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Abstract 

Background: The Gene Ontology project integrates data about the function of gene products across a diverse 
range of organisms, allowing the transfer of knowledge from model organisms to humans, and enabling 
computational analyses for interpretation of high-throughput experimental and clinical data. The core data structure 
is the annotation, an association between a gene product and a term from one of the three ontologies comprising 
the GO. Historically, it has not been possible to provide additional information about the context of a GO term, such 
as the target gene or the location of a molecular function. This has limited the specificity of knowledge that can be 
expressed by GO annotations. 

Results: The GO Consortium has introduced annotation extensions that enable manually curated GO annotations 
to capture additional contextual details. Extensions represent effector-target relationships such as localization 
dependencies, substrates of protein modifiers and regulation targets of signaling pathways and transcription factors 
as well as spatial and temporal aspects of processes such as cell or tissue type or developmental stage. We describe 
the content and structure of annotation extensions, provide examples, and summarize the current usage of 
annotation extensions. 

Conclusions: The additional contextual information captured by annotation extensions improves the utility of 
functional annotation by representing dependencies between annotations to terms in the different ontologies of 
GO, external ontologies, or an organism's gene products. These enhanced annotations can also support 
sophisticated queries and reasoning, and will provide curated, directional links between many gene products to 
support pathway and network reconstruction. 
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Background 

Comprehensive representation of the roles of gene prod- 
ucts, individually and in combination, is essential to the 
understanding and modeling of biological systems. In 
addition to a gene products intrinsic activity, aspects of 
the context in which it acts, such as the gene products it 
acts upon, subcellular location of the activity, distribution 
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in cell or tissue types, or temporal restrictions to a cell 
cycle phase or developmental stage, must be described in 
order to obtain a full description of its biological role. 

The Gene Ontology (GO) is a bioinformatics resource 
that uses structured controlled vocabularies (ontologies) 
to describe the molecular functions or activities of a gene 
product, the biological processes in which a gene product 
is involved and the cellular components in which a gene 
product is located. Associations or annotations' can be 
made between ontology terms and specific genes or gene 
products using a variety of manual or algorithmic methods 
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that rely upon experimental evidence or sequence similar- 
ity, for example, to support the assertion [1-4]. 

While the ontological rigor of the GO vocabularies has 
been enriched over the years through the use of expressive 
formalisms that permit logical reasoning and interaction 
with external ontologies [5], the annotations themselves 
have, until now, remained simple declarative statements. 
Each GO annotation is essentially a pair, combining a sin- 
gle gene product with a single GO term, plus supporting 
metadata such as the evidence for the association [6]. Fur- 
thermore, any gene product can be associated with many 
GO terms, and likewise any GO term could be used to 
annotate any number of gene products, the annotations 
thus coded remain independent. The simplicity of this 
core GO annotation model has facilitated the population 
of large annotation datasets, but this simplicity has, as 
well, been unable to capture the interconnections between 
multiple annotations to multiple genes, resulting in limita- 
tions on the granularity and connectivity of information 
that could be captured. Figure 1 illustrates this by showing 
a subset of Molecular Function and Cellular Component 
annotations to several gene products, including micro- 
somal glutathione transferase 1 [7]. While the annotations 
can describe which activities the gene products can 
perform, and in which components they are located, there 
is no way of combining this information to convey which 
activities are performed in which locations. 

Guidelines for pre-composing ontology terms 

In the core GO annotation model, adding new terms to 
the ontology, or "pre-composing" terms, has traditionally 
captured additional biological detail. However, we have set 
limits on how specific terms may be differentiated from 
one another. For example, generally we do not add new 
terms for activities or processes that are identical apart 
from which specific genes or gene products they affect. 

To illustrate this, consider the two terms 'regulation 
of transcription from RNA polymerase II promoter 
(GO:0006357) and 'regulation of Sonic hedgehog tran- 
scription from RNA polymerase II promoter. The sec- 
ond term would not be added to the Biological Process 
ontology because the only difference between it and its 
parent is the target of the regulation; the core process 
represented by the term is mechanistically no different 
from analogous processes governing transcription of 
other genes. Although for the purpose of understand- 
ing the biology it is important to capture information 
about gene products that specifically regulate the tran- 
scription of Sonic hedgehog, it would not be practical 
to create a specific transcription regulation term in the 
Biological Process ontology for every regulation target 
in a genome. 

We also want to avoid pre-composing GO terms that 
combine many concepts, or whose term label is very 



long, making it difficult for humans to easily interpret 
their meaning. 

To enable curators to create flexible, meaningful GO 
annotations at the time of annotation that represent a 
more complete picture of gene product roles in their 
biological context, we have introduced annotation exten- 
sions to the GO annotation model. Curators can add detail 
to GO annotations using controlled vocabularies (either 
GO or external ontologies, such as Cell Type Ontology 
(CL) [8]; Uber Anatomy Ontology (Uberon) [9] or Plant 
Ontology (PO) [10]) and biological entities such as genes 
or their products. GO annotations with extensions thus 
incorporate an increased level of detail and biological inte- 
gration, supporting more sophisticated querying and ana- 
lysis. We have applied this model to the curation of gene 
products from species such as mouse, human and fission 
yeast and are proceeding to implement it throughout the 
GO Consortium. 

Here we describe how annotation extensions have been 
incorporated into the GO annotation system, summarize 
the relationship types we use for extensions, and provide 
examples of how extensions can be displayed and applied, 
using a corpus of annotations we have developed. 

Results 

Extending basic annotations with relationships 

We extended the core GO annotation model to accom- 
modate annotation extensions. The annotation extension 
model is described formally in terms of the Web Ontology 
Language (OWL) in the 'Methods' section. Conceptually, 
we take existing GO terms such as 'protein kinase activity 
(GO:0004672) or 'nucleus (GO:0005634) and describe a 
more specific subtype through the use of one or more 
formal relationships to other entities (such as the protein 
that is the target of the kinase, or the cell type which the 
nucleus is a part of). This is logically equivalent to creat- 
ing a new term for the subtype in the ontology. 

An extended annotation is an annotation to a GO term 
followed by one or more relational expressions (extensions). 
Each relational expression is written as Relation(Entity), 
where Relation is a label denoting a relationship type, and 
Entity is an identifier for a database object or ontology 
term. Each such expression can be thought of as refining 
the core GO term used. For example, the Entity identifier 
for 'keratinocyte (CL:0000312) from the Cell Type Ontol- 
ogy (CL) can be combined with the Relation ( part_of to 
create the expression "part_of(CL:0000312)", and when 
combined with the GO term 'nucleus (GO:0005634) now 
describes a gene product that localizes to the nucleus of a 
keratinocyte. 

Relations 

We created an application ontology that extends the OBO 
(Open Biomedical Ontologies) Relations Ontology (RO) 
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Figure 1 Representation of annotations made using the core GO annotation model. Gene products can be annotated to several GO terms, 
and any GO term can be used to annotate any number of gene products, but the annotations remain independent. The stars indicate an 
annotation of a gene product to the GO term and each colour represents a single gene product. Using this simple GO annotation model, it is not 
clear from the annotations shown in which Cellular Component each of the protein activities are performed. For example, microsomal glutathione 
S-transferase 1 is represented by the red star and can perform two activities; glutathione transferase activity and glutathione peroxidase activity. It is 
found in three Cellular Components, the mitochondrion, endoplasmic reticulum and peroxisomal membrane, but from these annotations the 
knowledge that the glutathione transferase activity is performed in the mitochondrion [7] cannot be found. For clarity not all annotations of 
each gene product are shown nor all terms between the specified terms and the root node. 



[11] with a set of relations created explicitly for use in GO 
annotation extensions. These were selected and defined 
for practical use by iterative discussion among curators, 
and are collected in a file maintained in OBO format [12], 
and also available in OWL format [13]. To enable curators 
to select the appropriate relation we have created a graph- 
ical web view (Figure 2; [14]), and organized relations into 



subsets by entity type (for example, all relations where a 
chemical entity can be specified are grouped in the 
'chemical' subset). 

The set of relations used fall into two broad categories - 
molecular relations, which take an entity such as a gene, 
gene product, complex or chemical as an argument; and 
contextual relations, which take an entity such as a cell 
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Figure 2 Graphical web view of the annotation extension relations, (a) A graphical view of the relations was created to assist curators in 
selecting the appropriate relation for curation [14]; (b) A user can zoom into a particular area of the graph; (c) Relations can be clicked to view 
information, such as which GO terms the relation can be used with and which identifiers may be used with the relation. 



type, anatomy term, developmental stage or a GO term as 
an argument. Table 1 lists the most frequently used anno- 
tation extension relations with examples of their usage. 

Entities 

Identifiers used for the entities in annotation extensions 
can reference GO or another ontology or database. Each 
identifier must have a prefix found in the GO Database 
Abbreviations file [15], for example "UniProtKB" (protein 
database) [16], "CHEBI" (chemical database) [17], "CL" 
(cell type ontology) [8], "Uberon" (metazoan anatomy 
ontology) [9], or "PO" (plant anatomy ontology) [10]. A 
gene product identifier used in an annotation extension, 
should be interpreted in the context of the primary GO 
term used. For example, the inclusion of the gene iden- 
tifier SGD:S000004660 in the annotation extension field 
associated with the GO term 'protein phosphorylation 



should be interpreted as "the protein product of SGD: 
S000004660 is phosphorylated". 

Combining multiple extensions 

In this new system, a single GO annotation can have 
multiple relational expressions associated with it, where 
each expression uses a single relation and a single entity. 
Multiple expressions using the same relation are permitted. 
For example, if a gene product can carry out its activity in 
multiple locations or during various processes, multiple 
Relation(Entity) pairs may be added as separate annotation 
extensions. 

To illustrate, consider a gene product that has its ac- 
tivity in a neuron of the hippocampus. Here it would 
be appropriate to make an extension combining two 
expressions for both the cell type (neuron) and the 
gross anatomical structure (hippocampus). If this gene 



Table 1 Most commonly used relationships for annotation extension statements and examples of their usage 

Contextual relationships Example (gene product; primary GO term; annotation extension) 

part_of 
occurs_in 
happens_during 



C. elegons psf-1; nucleus; part_of(WBbt:0006804 body wall muscle cell) 

Mouse opsin-4; G-protein coupled photoreceptor activity; occurs_in(CL:0000740 retinal ganglion cell) 
S. pombe wis4; stress-activated MAPK cascade; happens_during(GO:0071470 cellular response to osmotic stress) 



Molecular relationships 

has_regulation_target 

has_input 
has_direct_input 



Example (gene product; primary GO term; annotation extension) 

Human suppressor of fused homolog SUFU; negative regulation of transcription factor import into nucleus; 
has_regulation_target(UniProtKB:P08151 zinc finger protein GUI) 

S. pombe rlf2; protein localization to nucleus; has_input(PomBase:SPAC26H5.03 pcf2) 

Human WNK4; chloride channel inhibitor activity; has_direct_input(UniProtKB:Q7LBE3 Solute carrier family 26 member 9) 



Molecular relations take an entity such as a gene, gene product, complex or chemical as an argument; contextual relations take an entity such as a cell type, 
anatomy term, developmental stage or a GO term as an argument. Entity names in italics are shown for clarity and are not part of the annotation 
extension format. 
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product also had the same activity in an epithelial cell, 
this expression could be combined in the same annotation 
extension. 

A description of the semantics of multiple extensions 
used in the annotation extension model can be found in 
the 'Methods' section. 

Annotation extensions in curation to specify molecular 
targets 

Schizosaccharomyces pombe protein Nepl illustrates 
how annotation extensions can be used to represent the 
multiple targets of a gene products enzymatic activity. 
Nepl is a protease that can deneddylate proteins modi- 
fied by Nedd8 [18]. It has been shown to deneddylate 
three cullin proteins, Cull, Cul3 and Pcu4 (Figure 3a). 
Using the core GO annotation model described above, 



it was not possible to record the cullins as the targets 
of the deneddylation activity of Nepl. The annotation 
would be: 

( NEDD8-specific protease activity' (GO:0019784) with 
the evidence code Inferred from Mutant Phenotype' 
(IMP) (Figure 3b). 

Using annotation extensions the annotation can 
be enriched as follows: Nepl is annotated to the term 
( NEDD8-specific protease activity' (GO:0019784) with the 
evidence code IMP, and with several Relation(Entity) pairs 
specifying the gene product targets of the activity: 

has_direct_input(PomBase:SPAC17G6.12)| 
has_direct_input(PomBase:SPAC24H6.03)| 
has_direct_input(PomBase:SPAC3A11.08) 
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Figure 3 The deneddylation activity of S. pombe Nepl. (a) The 

experimental data reported in [18] is interpreted as: Nepl is capable 
of deneddylating the cullins Cull, Cul3 and Pcu4. (b) and (c) 
Graphical representation of Nepl annotations using (b) the core GO 
annotation model and (c) the extended GO annotation model. 



(See 'Methods' section for a description of the semantics 
used in the annotation extension model). 

This annotation means that a nepl mutant phenotype 
(IMP evidence in [18]) indicates that Nepl executes 
NEDD8-specific protease activity and can deneddylate 
Cull (SPAC17G6.12), Cul3 (SPAC24H6.03) and Pcu4 
(SPAC3A11.08) (Figure 3c). We use the relation has_dir- 
ect_input here with a Molecular Function term to indicate 
the effector-substrate relationship between the gene prod- 
uct and its target protein. The PomBase display of the 
Nepl annotation is shown in Figure 4, note the annotation 
extension relation names have been translated to more 
human-readable text [19]. 

Annotation extensions in curation to specify locational 
context 

To illustrate how annotation extensions may be used to 
specify locational context, we use the example of the rat 
signaling complex subunit, mAKAP. mAKAP has been 
shown by immunocytochemical assay to be located on 
the nuclear envelope of cardiomyocytes [20] . 

With the core annotation model, we are only able to 
capture the cellular compartment that mAKAP is located 
in 'nuclear envelope (GO:0005635) with the evidence code 
Inferred from Direct Assay (IDA). 

Using the annotation extension model as follows, we 
can capture the cellular and anatomical context of the 
location of mAKAP such that mAKAP is annotated to 
the term 'nuclear envelope (GO:0005635) with the evidence 
code IDA and with two Relation(Entity) pairs specifying the 
cell and tissue locations of the nuclear envelope: 

part_of(CL:0002495), part_of(UBERON:0002082) 

This annotation means that a direct assay (immunocyto- 
chemical assay in [20]) has shown that rat mAKAP is 
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Figure 4 Display of annotation extension data in PomBase for S. pombe Nepl gene product. Annotation of the observation that Nep1 
deneddylates the three cullins Cull, Cul3 and Pcu4 [18] requires one annotation with three separate expressions in the annotation extension. 
Note that more human-readable text has been substituted for the annotation extension relation names for display purposes in PomBase. The 
underlying data retain the relation names, and the mapping between relation names and display text is available on the PomBase website [19]. 



located at the nuclear envelope of 'fetal cardiomyocytes 
(CL:0002495) of the 'cardiac ventricle (UBERON:0002082). 

Interconversion of core GO annotations and annotation 
extensions 

As annotation extensions are a relatively new feature of 
GO curation, we have described and implemented methods 
that allow legacy tools (i.e. those that do not have support 
for extensions in their data models) to use extended 
annotations without loss of specificity [21]. We have also 
implemented reverse methods that allow the conversion 
of basic GO annotations to extended annotations. We 
informally call these methods 'folding' and unfolding' re- 
spectively, and these make use of the OWL formalization 
of the GO (see 'Methods' section for details). The folding 
operation creates a new application ontology on the fly, 
with each extended annotation materializing a new GO 
term. An OWL reasoner is used to automatically con- 
struct the graph in this new ontology. Application of 
this method can be seen as a stopgap to allow continued 
use of existing tools - the resulting application ontology, 
whilst logically complete, may be unwieldy for querying 
and browsing. The unfolding method takes annotations to 
existing highly specific GO terms, and replaces them with 
an annotation to a more basic GO term, with the equiva- 
lent additional information now expressed as extensions. 
Unfolding annotations is useful for reducing the complexity 
of GO terms when querying or browsing. 

Discussion 

Practical application of annotation extensions 

Several member groups of the GO Consortium are now 
producing extended annotations to enrich their dataset. 



A summary of the numbers of extended annotations 
categorized by species is shown in Table 2. 

Currently there are few applications, databases or 
browsers that make use of, or display, extended anno- 
tations. In addition to their inclusion in the annotation 
files, extended annotations are currently displayed in 
the GO Consortium browser, AmiGO 2 [22] and on the 
PomBase gene information pages [23] and there are 
plans to display them in UniProt-GOAs GO browser, 
QuickGO [24], and on WormBase gene pages [25]. Outside 
of the GO Consortium, Ensembl Genomes [26] now display 
annotation extensions for S. pombe genes and these can be 

Table 2 Extended annotations categorized by species 



Species Total no. manual No. extended % extended 

annotations annotations annotations 



Mus musculus 


409098 


25209 


6.2 


Homo sapiens 


219258 


9042 


4.1 


Saccharomyces 
cerevisiae 


53750 


2713 


5.0 


Schizosaccharomyces 
pombe 


29049 


1902 


6.5 


Caenorhabditis elegans 


27488 


1102 


4.0 


Arabidopsis thaliana 


101936 


503 


0.5 


Rattus norvegicus 


72280 


477 


0.7 


Escherichia coli 


11658 


426 


3.7 


Dictyostelium 
discoideum 


19278 


228 


1.2 


Drosophila 
melanogaster 


109886 


214 


0.2 



The number of extended annotations is shown compared to the total number 
of manual annotations for each species. Calculated with the statistics from the 
UniProt-GOA database [3] on 21 November 2013. 
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used for querying annotation sets in the Ensembl Fungi 
BioMart [27]. 

As extension data becomes more widely available, 
querying for functional information can become more 
sophisticated. Users of the GO will be able to query the 
annotations for a wealth of specific information, including 
connections between a gene product and other entities 
and processes, or the locations — at the subcellular level 
as well as cell and tissue types — where a gene product 
performs specific roles. For example, a user could query 
for all targets of a particular protein kinase, or compose a 
more specific query to find all the proteins that are in- 
volved in blood vessel remodeling during retina vascula- 
ture development in the camera-type eye. Annotation 
extensions capturing effector-target relationships at the 
cellular level will provide a rich source of directional in- 
formation for regulatory network reconstruction. For 
instance, the has_input and has_direct_input relations 
can be used to connect signal transducing components 
of signaling pathways or to link DNA binding regulatory 
transcription factors with their specific target genes. 
The inherent directionality encoded in the extension can 
also be used to increase the information content of existing 
interaction-based networks. Annotation extensions can also 
assist with improving the interpretations of pathway ana- 
lysis. Currently pathway analysis, which uses methods such 
as term enrichment and pathway topology, is hampered by 
the lack of functional annotation with associated contextual 
aspects such as cell or tissue type or dependencies on other 
gene products or substances [28]. GO has the potential to 
enable great advances in pathway analysis by providing this 
contextual information in annotation extensions. 

Pre- vs. post-composition of GO terms 

As described above, increased specificity of GO annotations 
has historically been achieved by adding new, more specific 
ontology terms. However, new term addition cannot ac- 
commodate every detail that would be desirable to capture 
in GO annotations. 

Using annotation extensions to increase annotation 
specificity is logically equivalent to creating new terms 
in the ontology (see 'Methods' section), but allows a 
more streamlined approach for information capture at 
the time of annotation. Extended annotations can be 
'folded' to create a logical equivalent of a GO term, re- 
gardless of whether the term is included in the ontology. 
GO terms that are included in the ontology are said to 
be pre-composed; whereas the combination of terms and 
annotation extensions effectively post-compose' a term. It 
is also possible to perform the inverse and unfold' pre- 
composed GO terms into the equivalent extended annota- 
tion expression (see 'Methods' section). Whether terms 
are pre- or post-composed during the annotation process 
is thus not critical because it is possible to interconvert 



seamlessly between the two. Identical information can 
thus be captured by either of two routes, creation of a 
new pre-composed term or during the recording of an 
annotation. 

Although many details captured in annotation extensions 
will remain outside the scope of GO terms indefinitely, GO 
developers will investigate systems by which annotation ex- 
tensions can be automatically converted to pre-composed 
terms when certain criteria are met, for example where a 
certain number of annotations have identical extensions 
and the pre-composed term is in scope. The new terms will 
be added to the ontology using logical definitions that make 
them equivalent to the post-compositional annotation. 
Annotations made previously using post-composition can 
be processed to the new pre-composed terms. 

In the future, maintaining a good balance between 
pre- and post-composition will be assisted by automated 
methods to reason over annotations enhanced with annota- 
tion extensions to ensure the annotations are consistently 
grouped by an appropriate common ancestor GO term. 

Impact on users of Gene Ontology annotation 

The GO Consortium will provide annotation extension 
data as unfolded annotations, i.e. in the annotation files, 
the annotation extension will be kept in a separate field 
to the primary annotation. Consumers of annotation data 
can therefore choose to be unaffected by annotation ex- 
tensions by simply ignoring the additional field. However, 
we do hope that users and tool developers will incorporate 
the extensions into their tools and workflows to provide 
additional specificity to their queries and tools. For ex- 
ample, a term enrichment tool provider might provide 
an option to fold the annotation extensions into pre- 
composed terms before a user performs term enrich- 
ment. A GO browser could be extended to include an 
option to search folded annotation extensions as well 
as regular GO terms, e.g. it would be possible to search 
for all gene products that are involved in epithelial cell 
differentiation, whether or not the cell type was curated 
using the specific GO term or in the annotation extension 
with the more general GO term cell differentiation'. A 
basic query for a GO term will necessarily find the annota- 
tions to that term (and its child terms) with and without 
extensions, the user may choose whether or not to use the 
extension data. 

We encourage users and tool developers to contact us with 
specific questions so we can assist them with using this data. 

Future developments 

A longer-term goal of the GO Consortium is to link an- 
notations together to fully describe the directionality and 
dependencies in a whole pathway or process. Although 
annotation extensions are not sufficient to represent 
complete biological pathways, they provide a valuable 
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set of data that future work can build upon. A more ex- 
pressive annotation system is now under development 
within the GO Consortium, which will allow curators 
to join annotations sourced from different publications 
and with different supporting evidence to describe entire 
pathways or sub-processes. The annotation extensions 
currently being captured will feed directly into the new 
modular annotation system [29]. 

Conclusions 

GO annotation extensions have been introduced to en- 
hance the depth and utility of annotation data by capturing 
specific contextual information regarding a gene products 
function or location. Curators can now create, on-the-fly, 
complex GO annotations that describe dependencies and 
consequences of a gene product s function or location more 
completely than was previously possible. Data curated using 
annotation extensions provides a repository for experimen- 
tally verified regulation targets for a wide range of gene 
products, including transcription factors and microRNAs, 
information that is currently not captured by other stan- 
dardized annotation approaches. A large corpus of anno- 
tations now make use of annotation extensions, and this 
number is growing rapidly as groups make use of powerful 
curation tools such as UniProt-GOAs Protein2GO [30]. 
Extensive annotation enhancement makes GO data more 
informative for a biologists understanding of a gene or 
process of interest, and provides additional value to the 
data which can be used by GO analysis tool providers to 
enhance the interpretation of high-throughput datasets, 
such as those created by next generation sequencing, tran- 
scriptomic and proteomic studies. 

Methods 

Annotation Extension Model 

Annotation extensions are a means of dynamically referring 
to subtypes of existing GO terms, by means of sets of 
relation-value pairs, connected via either "and" or "or" op- 
erators (represented in GO annotation files using "," and 
"|", respectively). 

We present a formal treatment of the GO annotation 
extension model and the syntax used to write extensions. 
This formal underpinning is necessary to clarify the seman- 
tics of annotation extensions and to enable the use of auto- 
mated reasoners to perform useful computations. However, 
the details of the formal underpinnings can be hidden in 
tools used by curators and end-users, and instead presented 
in intuitive ways. 

Formalization 

We formalize the annotation extension model in terms of 
Description Logics, and in particular the Web Ontology 
Language (OWL) [31]. The GO is already heavily axioma- 
tized in OWL [32]. In the core GO annotation model, an 



annotation is an association between a gene or gene prod- 
uct G and an OWL Class C. C is restricted to be a class 
from one of the three sub-ontologies of the GO: Molecu- 
lar Function, Biological Process, or Cellular Component. 
The meaning of the association varies depending on which 
of these three sub-ontologies are used - there are a num- 
ber of ways of formalizing this in OWL, however, we do 
not provide details here as this is not in the scope of the 
extensions provided in this manuscript. 

The GO annotation extension model is formally a relax- 
ation of the core model, in that it allows the annotation to 
be to any OWL Class Expression that conforms to the fol- 
lowing profile. 

ClassExpression ::= Class | ObjectlntersectionOf 
(Class RelationalExpression+) 

RelationalExpression ::= ObjectSomeValuesFrom 
(ObjectProperty Class) 

For a description of the constructs used in the above, 
please see the OWL2 syntax and semantics document [33]. 
The main language constructs used are (1) intersections, 
which are interpreted as set-intersection (2) existential 
restrictions ("some values from") which correspond to 
standard relationships such as those found in the GO 
and (3) object properties, also known as relations. 

It can be seen that annotation extensions form a subset 
of the EL++ profile [34], which thus allows the use of fast 
reasoners such as Elk [35]. This is important for the GO, 
which contains large numbers of annotations. 

One consequence of this model is that the external 
entities, being related, must be modelled as OWL clas- 
ses rather than OWL individuals. In practice this is not 
a limitation, as molecular entities such as proteins are 
typically modelled as classes [36]. 

Syntax 

Annotation extensions can be expressed in a backwards 
and forwards compatible extension to existing exchange 
formats such as Gene Association Format (GAF); GAF 
2.0 extends GAF 1.0 by providing an additional column 
(position 16) in which to write a set of relational expres- 
sions, as defined above. This column is optionally filled 
with a disjunctive expression conforming to the following 
Bachus Normal Form (BNF) grammar: 

AnnotationExtension ::= 
RelationalExpressionConj unction { 
" | "RelationalExpressionConj unction } 

RelationalExpressionConj unction ::= 
RelationalExpression {"," RelationalExpression } 
RelationalExpression ::= RelationSymbol "(" ClassID ")" 
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A disjunction is equivalent to multiple independent 
annotations each consisting of a conjunctive expression. 

The conjunctive expression is translated to an OWL 
intersection expression whose elements are the main 
GO class being annotated together with all relational ex- 
pressions in the conjunction. Each relational expression 
is translated to an OWL existential restriction ("some 
values from"). The Relation Symbol is translated to an 
Object Property from the Relations Ontology, and the 
ClassID is translated to an OWL class, both according 
to the mapping provided in the OBO format document 
[37]. To precisely specify the semantics of multiple 
extensions in output files, the annotation formats pro- 
vided by the GO Consortium force the use of either 
the comma character (",") or the pipe character ("|") to 
separate each expression, where the comma indicates 
conjunction (AND) and the pipe indicates disjunction 
(OR). 

For example, an annotation to the term 'nuclear en- 
velope (GO:0005635) with an extension field filled 
with: 

part_of(CL:0002495), part_of(UBERON:0002082) 

(where the CL identifier denotes cardiomyocyte and the 
Uberon identifier denotes cardiac ventricle) is translated 
to be an annotation to the OWL class expression: 

GO_0005635 and (BFO_0000050 some CL_0002495) 
and (BFO_0000050 some UBERON_0002082) 

(where the BFO (Basic Formal Ontology) identifier denotes 
the part_of relation). 

These expressions can be used by OWL reasoners to 
return guaranteed valid and complete answers to queries 
such as "find all annotations to classes that are part of a 
cell nucleus and part of a heart". 

The syntax does not allow nesting of expressions, but 
the use of parentheses in the grammar allows for the 
introduction of nesting in the future. 

Property Chains 

The set of object properties used can be primitive 
relations (such as part_of, occur s_in or regulates) or 
relations defined via an object property chain. This 
effectively allows for a limited level of nesting in the 
annotated OWL class expression, extending the profile 
described above to: 

RelationalExpression ::= ObjectSomeValuesFrom 
(ObjectProperty ClassOrRelationalExpression) 

ClassOrRelationalExpression ::= Class | 
RelationalExpression 



For example, if a relation expression of regulates_oc- 
curs_ m(CL:0000540) is used, this is equivalent to an 
OWL class expression 

regulates some (occurs_in some CL_0000540) 

Based on the definition of regulates_occurs_in < - > 
regulates o occurs_in. 

These chains can be expanded in user-views - for ex- 
ample, AmiGO 2 will show the expression above as 
"regulates . occurs_in : neuron". 

Automated validation using reasoning 

We use the Elk reasoner to reason over annotation class 
expressions in order to make sure they are logically co- 
herent according to constraints encoded in the OWL 
version of the GO, the relations ontology (RO; [11]), and 
external ontologies. For example, an annotation to a 
nonsense class expression that contains occurs_in some 
apoptosis is flagged because the reasoner computes that 
this expression is unsatisfiable, due to the constraint that 
the range of occurs_in is a continuant (i.e. non-process). 

We also use reasoning to automatically deepen anno- 
tations to class expressions to the Most Specific Class 
(MSC) in the ontology. For example, if a gene product is 
annotated to 'postsynaptic density (GO:0014069) and has 
the extension field filled with "part_of{CL:0000127)" , this 
is directly translated to the class expression 'postsynaptic 
density' and part_of some astrocyte which is inferred to 
have the MSC G 0:0097483 ('glial cell postsynaptic dens- 
ity) based on equivalence axioms in the GO [5]: 'glial cell 
postsynaptic density' EquivalentTo 'postsynaptic dens- 
ity' and part_of some 'glial cell' and the axiom astro- 
cyte' SubClassOf 'glial cell' inferred from the Cell Type 
Ontology. 

These reasoner checks and deepening procedures are 
performed by the GO Continuous Integration server [38]. 
We translate Gene Association Files into OWL using 
OWLTools [39]. 

Annotation folding and unfolding procedure 

We define a process of annotation folding that takes as 
input the GO plus a set of supporting ontologies together 
with a set of extended annotations and generates as out- 
put an additional ontology plus a set of basic annotations, 
where the input and output are logically equivalent [21]. 
For each extended annotation a to a term t and extension 
expression e, we replace this with an annotation a! to a 
term tr, where tr is added to the application ontology, 
with an equivalence axiom t A EquivalentTo (t and e). A 
fast OWL reasoner such as Elk is used to automatically 
classify the application ontology. The completeness of the 
classification is related to the proportion of classes in the 
core GO ontology that have equivalence axioms. 
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The converse procedure of annotation unfolding takes 
as input the GO plus a set of supporting ontologies 
together with a set of basic annotations and generates as 
output a simplified GO plus a set of extended annotations. 
For each annotation a to a term £, if the term t has an 
equivalence axiom in the GO to an expression (f and e), 
where f is a GO term and e conforms to an extension 
expression, then replace a with a new annotation a\ 
where t is replaced by f and the extension field is filled 
with e. 

Curation procedures 

Annotation extensions are created as part of the manual 
curation process [6]. This involves biological database 
curators reading full text, peer-reviewed articles to obtain 
information about gene product functions, the processes 
in which they are involved and their subcellular locations 
[1-4]. Curators choose GO terms that describe these 
aspects of a gene product and assign an evidence code 
that is appropriate for the type of supporting experiment 
or statement in the paper. The GO annotations and any 
annotation extension information are entered into the 
annotating groups' curation tool for inclusion in their 
database and/or display on their website. On a periodic 
basis, each group submits their file(s) of annotations 
for display on the GO Consortium website [40] and ftp 
site [41]. 

Annotation extensions are formatted as Relation 
(Entity) - where 'Entity is an identifier in an ontology or 
database, expressed as 'DB:ID' - in the current GO anno- 
tation file format (GAF2.0, column 16) [42] and in the 
new format Gene Product Association Data (GPAD, col- 
umn 11) [43]. The DB prefix must be listed in the GO 
Database Abbreviations collection [15]. 

Data availability and resources 

Annotation extensions can be represented in the two 
GO Consortium-supported annotation formats, GAF 2.0 
[42] and GPAD [43]. These files are housed on the Gene 
Ontology Consortium website [40]. 

Annotation extension data is available in AmiG02 
[22] and for S. pombe genes is additionally displayed on 
the PomBase gene pages [44] and in the Ensembl Fungi 
BioMart [27]. 

Further documentation on annotation extensions can 
be found on the GO Consortium website [45]. 
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